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Robust Estimates of Location and Dispersion 



Introduction 

During the last four of five decades, statisticians have devoted a great 
deal of attention to robust estimation. Indeed, Tukey has described the "almost 
parametric models" as a third wave of statistics, following upon parametric and 
non-parametric theory (Koenker and Bassett 1978, p. 35). Advances in computer 
hardware and algorithms now allow researchers to apply these statistics to large 
data sets and to use procedures that combine robustness with high efficiency. 

Researchers need to be aware of these developments because classical 
statistical methods are quite vulnerable to outliers in data. The average and the 
standard deviation can be broken down by just a few anomalous observations. In 
other words, our usual estimates of location and dispersion may have large 
biases if the outliers remain undetected. This possibility arises especially in large 
data sets which are processed entirely by computer without a personal 
inspection by the researcher. 

Aberrant observations may result from unusual events, measurement 
errors, faulty transcription, or simply from a noisy error distribution. The outliers 
themselves may be of considerable interest if they represent meaningful 
exceptions --for example, an academic program which produces outstanding test 
scores. Because the average is not a robust measure of location, outliers may 
not show up in the residuals scattered around the average. This situation may be 
aggravated by a masking effect, in which one anomalous observation hides 
several other outliers. It follows that scrutiny of the residuals may not be effective 
unless robust estimates of location and dispersion have been computed. 

The next section of this paper describes the remedian, a robust location 
measure for large data sets. Then a robust dispersion measure, Sn, is 
presented; and its role in LI regression is discussed. The appendices contain 
Basic programs for these statistics. The programs are based on algorithms 
developed by Gilbert Bassett, Jr., Cristophe Croux, Annick M. Leroy, and Peter 
J. Rousseeuw. 

Location: the remedian 

Robust statistics use more computer time and memory than their 
nonrobust least-squares counterparts, a fact which has impeded robust 
estimation for very large data sets. A sample average, for example, requires little 
storage since the individual observations need not be retained in memory; each 
is just added to the running sum and then discarded. Computation of a median, 
on the other hand, requires that all the observations be available in memory for 
sorting. 

Rousseeuw and Bassett (1990) proposed the remedian, an iterated 
median which needs a small amount of memory to process large data sets. In 
the Basic version (Appendix A), 17 observations are read at a time, and their 
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median is computed and stored in a cell. The observations are then discarded, 
and the next 17 data are read. These steps are repeated until the entire sample 
has been processed. Using just five arrays, each with 17 observations, the 
program can process as many as 17® = 1,419,857 numbers. 

How well does the remedian deal with outliers ? One important indicator 
of robustness is the breakdown point, the amount of contamination an estimator 
can withstand. In a sample of n observations, just one outlier, if it is bad enough, 
can carry the average and the standard deviation indefinitely far from their 
population values, so the breakdown point of those classical statistics is only 1/n. 
The sample median's bias, on the other hand, remains finite when as many as 
half the observations undergo arbitrary contamination. Therefore, the median's 
breakdown point approaches fifty percent. This is the maximum breakdown point 
for an affine-equivariant estimator; if more than half the sample is corrupted, the 
estimator cannot tell the good observations from the bad ones. 

Rousseeuw and Bassett find that the remedian's breakdown point can be 
low when the outliers happen to clump together in one or a few groups. If, on the 
other hand, the outliers are located randomly throughout the sample, the 
probability of breakdown is quite small. Furthermore, the authors show that the 
remedian estimates the population median consistently and is equivariant under 
monotonic transformations of the data. The remedian is therefore an attractive 
robust estimator of location in large samples, where storage of the individual 
observations is impractical. 

Dispersion: the Sn statistic 

Because it is based on least squares, the standard deviation has a low 
breakdown point (asymptotically zero); a few stray observations can make it 
explode. Alternatively, one can compute the average absolute residual around 
the median, which also fails to be a robust measure of dispersion. The median of 
the absolute residuals does achieve the maximum breakdown point and is fairly 
easy to compute, but its efficiency is low (only about 37 percent). Rousseeuw 
and Croux (1993) proposed an high-breakdown alternative to the standard 
deviation. Denoted Sn, the estimator is unbiased and has an asymptotic 
efficiency of about 58 percent. For n observations x(i), Sn is simply proportional 
to the double median over all n(n-1)/2 differences | x(i) - x(j) | . Accordingly, Sn 
resembles a mode or "nearest neighbor" statistic. As such, it resists skewness 
and outliers. For moderate or large n, the computations would be quite 
burdensome were it not for Rousseeuw and Croux's clever algorithm. 

Among the many applications of Sn is the question of a scale estimate for 
LI regression problems, those which minimize the sum of the absolute residuals 
instead of the sum of the squared residuals (Dodge 1987, Dodge 1992). The LI 
regression has low breakdown if there are bad leverage points (extreme 
observations) among the independent variables. In that case, the least median of 
squares criterion or one of its variants is preferable (Rousseeuw and Leroy 
1987). In other situations, however, bad leverage points are precluded, and the 
outliers are confined to the dependent variable. This is true for polynomial trends 
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and for many designed experiments, including ANOVA models. In these cases, 
the LI regression attains the maximum breakdown point. For large samples and 
numerous independent variables, the LI regression is also easier to compute 
than the least median of squares since the former can be solved as an ordinary 
linear program or by special algorithms (Dodge 1992). 

What role does the Sn statistic play here ? Bassett and Koenker (1978) 
showed that the large-sample covariance matrix for LI regression differs from 
least squares only in the dispersion parameter. A principal obstacle to robust 
inference for the LI regression has therefore been the choice of a reasonably 
efficient, high-breakdown alternative to the standard deviation. This issue has 
been examined by McKean and Schrader (1987) and Koenker (1987), among 
other authors. 

It is well known that the LI regression always produces several zero 
residuals, as many as there are free parameters in the model. The zero residuals 
represent the observations "used up" in the estimation process. It seems natural 
to discard those zeros and then apply Sn to the nonzero residuals. The resulting 
"standard error of the estimate" would have large-sample validity when used in 
the usual regression and ANOVA hypothesis tests. 

Appendix B contains a Basic program to compute Sn. The online resource 
STATLIB has Rousseeuw and Croux's FORTRAN version of Sn. 
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Appendix A. A Basic Program for the Remedian 



REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

REM 

080 

090 

100 

110 

120 

130 

140 

150 

160 

170 

180 

182 

190 

192 

200 

210 

220 

230 

240 

242 

250 

260 

270 

280 

290 

300 

310 

320 

330 

332 

334 

340 



THIS PROGRAM COMPUTES THE REMEDIAN, A MEDIAN FOR 
LARGE DATA SETS. THE REMEDIAN, WHICH USES A 
SMALL AMOUNT OF COMPUTER STORAGE, WAS PROPOSED BY: 
PETER J. ROUSSEEUW AND GILBERT W. BASSETT, JR. 
(1990), "THE REMEDIAN: A ROBUST AVERAGING 
METHOD FOR LARGE DATA SETS," JOURNAL OF THE 
AMERICAN STATISTICAL ASSOCIATION, 85, 97-104. 

THE CURRENT VERSION OF THIS PROGRAM CAN HANDLE 
AS MANY AS 17^5 = 1,419,857 OBSERVATIONS. 

THE FUNCTION FMED, WHICH FINDS AN ORDER STATISTIC, 
WAS CREATED BY DR. ANNICK LEROY AND APPEARS IN 
"PROGRESS," A FORTRAN PROGRAM FOR ROBUST REGRESSION. 
THE CODE FOR SORTING A VECTOR OF DATA (LINES 600- 
834) WAS ALSO ADAPTED FROM THE "PROGRESS" PROGRAM. 

THIS PROGRAM CARRIES NO WARRANTY, EXPRESSED OR 
IMPLIED. THE PROGRAM'S SUITABILITY FOR COMMERCIAL 
USE OR FOR ANY PARTICULAR APPLICATION IS NOT 
GUARANTEED . 

DEFLNG I-N 

DECLARE FUNCTION FMED (Y ( ) ) 

PRINT "WHAT IS THE INPUT FILE ?" 

INPUT INFILE$ 

PRINT "WHAT IS THE OUTPUT FILE ?" 

INPUT OUTFILE$ 

PRINT "HOW MANY OBSERVATIONS (<= 1,419,857) ?■ 

INPUT NOBS 

OPEN INFILE$ FOR INPUT AS #1 

OPEN OUTFILE$ FOR OUTPUT AS #2 

DIM A1 (17) , A2 (17) ,A3 (17) ,A4 (17) ,A5 (17) 

DIM W(85) ,V(85) , JLV (85), JRV (85) 

1 = 0 

FOR N = 1 TO 17 
FOR M = 1 TO 17 
FOR L = 1 TO 17 
FOR K = 1 TO 17 
FOR J = 1 TO 17 
1 = 1 + 1 

IF I > NOBS GOTO 340 
INPUT #1, ADATUM 
A1(J) = ADATUM 
NEXT J 

A2(K) = FMED(AIO) 

NEXT K 

A3 (L) = FMED (A2 () ) 

NEXT L 

A4(M) = FMED (A3 ()) 

NEXT M 

A5 (N) = FMED (A4 ( ) ) 

NEXT N 
JW = J-l 



A2 



350 

360 

370 

372 

380 

390 

400 

410 

420 

430 

440 

450 

460 

470 

480 

490 

500 

510 

520 

530 

540 

550 

560 

570 

580 

582 

584 

586 

588 

590 

592 

600 

610 

620 

630 

640 

650 

660 

670 

680 

690 

700 

710 

720 

730 

740 

750 

760 

770 

772 

780 

782 

790 

792 



KW = K-l 
LW = L-l 
MW = M-l 
NW = N-l 



11=0 
FOR I = 
11 = 11+1 
V(II) = 
W(II) = 
NEXT I 
FOR I = 
11 = 11+1 
V(II) = 
W(II) = 
NEXT I 
FOR I = 
11 = 11+1 
V(II) = 
W(II) = 
NEXT I 
FOR I = 
11 = 11+1 
V(II) = 
W(II) = 
NEXT I 
FOR I = 



1 TO JW 

A1(I) 

1.0 

1 TO KW 

A2 (I) 

17.0 

1 TO LW 

A3 (I) 

289.0 

1 TO MW 

A4 (I) 

4913.0 

1 TO NW 



11=11+1 
V(II) = A5 (I) 

W(II) = 83521.0 
NEXT I 

NTOT = JW+KW+LW+MW+NW 

JSS = 1 

JLV(l) = 1 

JRV(l) = NTOT 

JNDL = JLV(JSS) 

JR = JRV(JSS) 

JSS = JSS - 1 
JNC = JNDL 
J = JR 

JTWE = (JNDL + JR) / 2 
XX = V (JTWE) 

IF V (JNC) >= XX GOTO 730 
JNC = JNC + 1 
GOTO 700 

IF XX >= V ( J) GOTO 760 
J = J - 1 
GOTO 730 

IF JNC > J GOTO 806 
AMM = V (JNC) 

WTEMP = W (JNC) 

V (JNC) = V(J) 

W (JNC) = W(J) 

V(J) = AMM 
W(J) = WTEMP 
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800 JNC = JNC + 1 

804 J = J - 1 

806 IF JNC <= J GOTO 700 

808 IF (J - JNDL) < (JR - JNC) GOTO 822 

810 IF JNDL >= J GOTO 818 

812 JSS = JSS + 1 

814 JLV(JSS) = JNDL 

816 JRV(JSS) = J 

818 JNDL = JNC 

820 GOTO 832 

822 IF JNC >= JR GOTO 830 
824 JSS = JSS + 1 
826 JLV(JSS) = JNC 
828 JRV(JSS) = JR 
830 JR = J 

832 IF JNDL < JR GOTO 660 

834 IF JSS <> 0 GOTO 630 

850 SUMW =0.0 

852 WCAP = (NOBS+1) / 2 

854 FOR I = 1 TO NTOT 

856 IF SUMW >= WCAP GOTO 870 

858 SUMW = SUMW + W(I) 

860 NEXT I 

870 REMEDIAN = V(I-l) 

880 PRINT #2, "REMEDIAN =" ; REMEDIAN 
890 END 

970 FUNCTION FMED (Y()) 

975 NH = 9 
980 LL = 1 
990 LR = 17 

1000 IF LL >= LR GOTO 1210 
1010 AX = Y (NH) 

1020 JNC = LL 
1030 J = LR 

1040 IF JNC > J GOTO 1180 
1050 IF Y (JNC) >= AX GOTO 1080 
1060 JNC = JNC + 1 
1070 GOTO 1050 

1080 IF Y ( J) <= AX GOTO 1110 

1090 J = J - 1 

1100 GOTO 1080 

1110 IF JNC > J GOTO 1170 

1120 WA = Y (JNC) 

1130 Y (JNC) = Y ( J) 

1140 Y ( J) = WA 
1150 JNC = JNC + 1 
1160 J = J - 1 
1170 GOTO 1040 

1180 IF J < NH THEN LL = JNC 
1190 IF NH < JNC THEN LR = J 
1200 GOTO 1000 
1210 FMED = Y (NH) 

1220 END FUNCTION 
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Appendix B. A Basic Program for the Sn Statistic 



REM THIS PROGRAM COMPUTES THE ROBUST DISPERSION ESTIMATE, 
REM Sn. THE FUNCTIONS AND SUBROUTINES IN THIS PROGRAM 
REM ARE ADAPTED FROM THE FORTRAN CODE "SN.FOR" 

REM CREATED BY PETER J. ROUSSEEUW AND HIS 
REM COLLEAGUES. THE THEORY AND STATISTICAL PROPERTIES 
REM OF THE ROBUST SCALE ESTIMATE SN ARE DISCUSSED 
REM IN THE FOLLOWING PAPER: 

REM 

REM P. J. ROUSSEEUW AND C. CROUX (1993) : 

REM "ALTERNATIVES TO THE MEDIAN ABSOLUTE DEVIATION, " 
REM JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 

REM 88, 1273-1283. 

REM 

REM THERE IS NO WARRANTY, EXPRESSED OR IMPLIED, FOR THIS 
REM PROGRAM. ITS SUITABILITY FOR COMMERCIAL USE OR FOR 
REM ANY PARTICULAR PURPOSE IS NOT GUARANTEED. 

REM 

005 DEFLNG I-N 

010 DECLARE FUNCTION SCALE (Y(),N) 

015 DECLARE FUNCTION FMED (Y ( ) , N) 

020 DECLARE SUB SORT(Y(),N) 

025 PRINT "WHAT IS THE INPUT FILE ?" 

030 INPUT INFILE$ 

035 PRINT "WHAT IS THE OUTPUT FILE ? " 

040 INPUT OUTFILE$ 

045 PRINT "HOW MANY OBSERVATIONS ?" 

050 INPUT N 

055 OPEN INFILE$ FOR INPUT AS #1 
060 OPEN OUTFILE$ FOR OUTPUT AS #2 
065 DIM X (N) 

070 FOR I = 1 TO N 
075 INPUT #1, X(I) 

085 NEXT I 

086 SN = SCALE (X(),N) 

088 PRINT #2, "ROBUST SCALE SN = " , SN 

090 END 



1500 FUNCTION SCALE(Y(),N) 
1502 DIM A2 (N) 

1504 CALL SORT (Y ( ) , N) 

1506 A2(l) = Y (N\2+l) -Y (1) 
1508 FOR I = 2 TO (N+l)\2 
1510 NA=I-1 
1512 NB=N- I 
1514 IDIFF=NB-NA 
1516 LEFTA=1 
1518 LEFTB=1 
1520 IRIGHTA=NB 
1522 IRIGHTB=NB 
1524 IAMIN=IDIFF\2+1 
1526 IAMAX=IDIFF\2+NA 
1528 REM 
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1530 IF LEFTA < I RIGHT A THEN 
1532 LENGTH=IRIGHTA-LEFTA+1 
1534 IEVEN=1- (LENGTH MOD 2) 

1536 IHALF= (LENGTH- 1) \2 
1538 ITRYA=LEFTA+IHALF 
1540 ITRYB=LEFTB+IHALF 
1542 IF ITRYA < I AMIN THEN 
1544 IRIGHTB=ITRYB 
1546 LEFTA=ITRYA+IEVEN 
1548 ELSE 

1550 IF ITRYA > IAMAX THEN 
1552 IRIGHTA=ITRYA 
1554 LEFTB=ITRYB+IEVEN 
1556 ELSE 

1558 RMEDA=Y (I) -Y (I-ITRYA+IAMIN-1) 
1560 RMEDB=Y (ITRYB+I) -Y (I) 

1562 IF RMEDA >= RMEDB THEN 
1564 IRIGHTA=ITRYA 
1566 LEFTB=ITRYB+IEVEN 
1568 ELSE 

1570 IRIGHTB=ITRYB 

1572 LE FTA= I TRYA+ IE VEN 

1574 END IF 

1576 END IF 

1578 END IF 

1580 GOTO 1528 

1582 END IF 

1584 IF LEFTA > IAMAX THEN 
1586 A2(I) = Y (LEFTB+I) -Y (I) 

1588 ELSE 

1600 RMEDA=Y(I) -Y (I-LEFTA+IAMIN- 1) 
1602 RMEDB=Y (LEFTB+I) -Y (I) 

1604 A2 (I) = RMEDA 

1606 IF RMEDB < RMEDA THEN 

1608 A2 (I) = RMEDB 

1610 END IF 

1612 END IF 

1614 NEXT I 

1616 FOR I = (N+l) \2+l TO N-l 

1618 NA=N-I 

1620 NB=I - 1 

1622 IDIFF=NB-NA 

1624 LEFTA=1 

1626 LEFTB=1 

1628 IRIGHTA=NB 

1630 IRIGHTB=NB 

1632 IAMIN=IDIFF\2+1 

1634 IAMAX=IDIFF\2+NA 

1636 REM 

1638 IF LEFTA < IRIGHTA THEN 
1640 LENGTH= IRIGHTA - L E FT A+ 1 
1642 IEVEN=1- (LENGTH MOD 2) 

1644 IHALF= (LENGTH- 1 ) \2 
1646 ITRYA=LEFTA+IHALF 
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1648 ITRYB=LEFTB+IHALF 
1650 IF ITRYA < I AMIN THEN 
1652 IRIGHTB=ITRYB 
1654 LEFTA=ITRYA+IEVEN 
1656 ELSE 

1658 IF ITRYA > IAMAX THEN 
1660 IRIGHTA= ITRYA 
1662 LEFTB=ITRYB+IEVEN 
1664 ELSE 

1666 RMEDA=Y (I+ITRYA- IAMIN+1) -Y (I) 

1668 RMEDB=Y (I) -Y (I-ITRYB) 

1670 IF RMEDA >= RMEDB THEN 
1672 IRIGHTA= ITRYA 
1674 LEFTB=ITRYB+IEVEN 
1676 ELSE 

1678 IRIGHTB=ITRYB 

1680 LEFTA= ITRYA+ IEVEN 

1682 END IF 

1684 END IF 

1686 END IF 

1688 GOTO 1636 

1690 END IF 

1692 IF LEFTA > IAMAX THEN 
1694 A2 (I) =Y (I) -Y (I-LEFTB) 

1696 ELSE 

1698 RMEDA=Y (I+LEFTA- IAMIN+1) -Y(I) 

1700 RMEDB=Y (I) -Y (I-LEFTB) 

1702 A2 (I) = RMEDA 

1704 IF RMEDB < RMEDA THEN 

1706 A2 (I) = RMEDB 

1708 END IF 

1710 END IF 

1712 NEXT I 

1714 A2 (N) =Y(N) -Y( (N+l) \2) 

1716 CN=1 

1718 IF N <= 9 THEN 
1720 IF N = 2 THEN CN = 0.743 

1722 IF N = 3 THEN CN = 1.851 

1724 IF N = 4 THEN CN = 0.954 

1726 IF N = 5 THEN CN = 1.351 

1728 IF N = 6 THEN CN = 0.993 

1730 IF N = 7 THEN CN = 1.198 

1734 IF N = 8 THEN CN = 1.005 

1736 IF N = 9 THEN CN = 1.131 

1738 ELSE 

1740 IF (N MOD 2) = 1 THEN CN = N/(N-0.9) 
1742 END IF 

1744 SCALE = CN*1 . 1926*FMED (A2 () ,N) 

1746 END FUNCTION 
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594 SUB SORT (Y ( ) ,N) 

596 DEFLNG I-N 

598 DIM JLV(N), JRV (N) 

600 JSS = 1 
610 JLV(l) = 1 
620 JRV ( 1 ) = N 
630 JNDL = JLV(JSS) 

640 JR = JRV (JSS) 

650 JSS = JSS - 1 
660 JNC = JNDL 
670 J = JR 

680 JTWE = (JNDL + JR) \ 2 
690 XX = Y (JTWE) 

700 IF Y (JNC) >= XX GOTO 730 
710 JNC = JNC + 1 
720 GOTO 700 

730 IF XX >= Y ( J) GOTO 760 

740 J = J - 1 

750 GOTO 730 

760 IF JNC > J GOTO 806 

770 AMM = Y (JNC) 

780 Y (JNC) = Y ( J) 

790 Y ( J) = AMM 

800 JNC = JNC + 1 

804 J = J - 1 

806 IF JNC <= J GOTO 700 

808 IF (J - JNDL) < (JR - JNC) GOTO 822 

810 IF JNDL >= J GOTO 818 

812 JSS = JSS + 1 

814 JLV(JSS) = JNDL 

816 JRV (JSS) = J 

818 JNDL = JNC 

820 GOTO 832 

822 IF JNC >= JR GOTO 830 
824 JSS = JSS + 1 
826 JLV(JSS) = JNC 
828 JRV (JSS) = JR 
830 JR = J 

832 IF JNDL < JR GOTO 660 
834 IF JSS <> 0 GOTO 630 
836 END SUB 

970 FUNCTION FMED (Y(),N) 

975 DEFLNG I-N 

980 LL = 1 

990 LR = N 

995 NH = (N+l) \2 

1000 IF LL >= LR GOTO 1210 

1010 AX = Y (NH) 

1020 JNC = LL 
1030 J = LR 

1040 IF JNC > J GOTO 1180 
1050 IF Y (JNC) >= AX GOTO 1080 
1060 JNC = JNC + 1 




1070 GOTO 1050 

1080 IF Y(J) <= AX GOTO 1110 

1090 J = J - 1 

1100 GOTO 1080 

1110 IF JNC > J GOTO 1170 

1120 WA = Y (JNC) 

1130 Y (JNC) = Y ( J) 

1140 Y(J) = WA 
1150 JNC = JNC + 1 
1160 J = J - 1 
1170 GOTO 1040 

1180 IF J < NH THEN LL = JNC 
1190 IF NH < JNC THEN LR = J 
1200 GOTO 1000 
1210 FMED = Y (NH) 

1220 END FUNCTION 
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